Prediction-oriented prognostic biomarker discovery with survival machine learning methods

Abstract Identifying novel and reliable prognostic biomarkers for predicting patient survival outcomes is essential for deciding personalized treatment strategies for diseases such as cancer. Numerous feature selection techniques have been proposed to address the high-dimensional problem in constructing prediction models. Not only does feature selection lower the data dimension, but it also improves the prediction accuracy of the resulted models by mitigating overfitting. The performances of these feature selection methods when applied to survival models, on the other hand, deserve further investigation. In this paper, we construct and compare a series of prediction-oriented biomarker selection frameworks by leveraging recent machine learning algorithms, including random survival forests, extreme gradient boosting, light gradient boosting and deep learning-based survival models. Additionally, we adapt the recently proposed prediction-oriented marker selection (PROMISE) to a survival model (PROMISE-Cox) as a benchmark approach. Our simulation studies indicate that boosting-based approaches tend to provide superior accuracy with better true positive rate and false positive rate in more complicated scenarios. For demonstration purpose, we applied the proposed biomarker selection strategies to identify prognostic biomarkers in different modalities of head and neck cancer data.


INTRODUCTION
The fast advancement of genomic sequencing and other high-throughput molecular profiling technologies has made it possible to characterize tens of thousands of genes and biomarkers at the same time. Due to the high dimensionality and complex data dependence structure, deploying genome-wide biomarker screening has posed both statis-tical and computational challenges. In statistics and machine learning modeling, the search for biomarkers can be viewed as feature selection or variable selection in a r egr ession model. The purpose of this article is to discuss the analysis of prognostic biomarkers with the main goal of identifying markers that are associated with and can be predictable for patient survival outcomes. Multivariable survival analysis is complicated by the problem of censoring, which occurs when the patient survival time after the date of diagnosis is only known in part. Another critical but frequently o verlook ed fact is that all survival models are built based on certain distributional assumptions about survival time.
The Cox proportional hazards r egr ession model ( 1 ) has been one of the most widely used tools for survival analysis in biomedical r esear ch, but its traditional implementation is not suitable for analyzing modern genomic data, where the number of candidate biomarkers is much greater than the sample size. Various feature selection techniques have been de v eloped to address the high-dimensional prob lem. One of the most commonly used techniques is to apply penalization constraints to the original likelihood or objecti v e functions and thereby generate sparse solutions, such as the lasso ( 2 ) and elastic net ( 3 ) penalty terms. By incorporating lasso or elastic net with a combined cross-validation (CV) and stability selection (SS) procedure for parameter turning, Kim et al. ( 4 ) proposed a method called predictionoriented marker selection (PROMISE). Another popular solution for solving high-dimensional problems in machine learning is to use tree-based boosting algorithms such as extreme gradient boosting (XGB) ( 5 ) and light gradient boosting (L GB) ( 6 ), w hich are well known for their high computa tional ef ficiency and exceptional predictive performance when accommodating nonlinear effects. Both XGB and LGB have been recently extended and evaluated in the context of a survival model ( 7 ). Based on the deep learning ar chitectur e, the stud y by Ka tzman et al. ( 8 ) proposed a Cox deep neural network (DeepSurv) method, where the Cox likelihood loss was implemented in a multilayer feedforwar d networ k. While indi vidual benchmar k studies hav e reported promising results, there are few studies directly comparing the performances of these adv anced surviv al models in biomarker selection and survival outcome prediction.
The purpose of this article is to conduct a systematical evaluation of the performances of most widely used survival machine learning methods under a variety of scenarios, with a particular emphasis on predicti v e biomar ker prioritization. Our work is motivated by an increasing demand for guidance and standar ds regar ding the deployment of prognostic biomark er disco very analysis pipelines in cancer r esear ch, wher e survival outcomes ar e one of the most critical clinical outcomes or endpoints. Fi v e r epr esentati v e machine learning approaches were chosen as a result. In general, the primary goal of most machine learning methods is to provide high prediction accuracy. Howe v er, the featur e selection r esults may not be clearly clarified, especially in nonlinear machine learning models. Many machine learning software packages provide feature importance as an option for feature selection, but they do not gi v e a threshold to determine the final decision. Although the feature selection results can be directly determined by the coefficients in linear models, the results can still become unreliable due to excessive false positives. To reduce the number of false positi v es and keep the high prediction accuracy, the study by Kim et al. ( 4 ) proposed PROMISE that combines CV and SS together in generalized linear models. To deal with the survival data, we extend PROMISE from the generalized linear model to the Cox model ( PROMISE-Cox). Furthermore, in order to identify the significant features in nonlinear machine learning models, we propose a pr ediction-oriented featur e selection algorithm that combines CV and topk selection. The description of the proposed algorithm with pseudo-code can be found in Supplementary Table S2. We use 'XGB-Cox-select', 'LGB-Cox-select', 'RSF-select' and 'DeepSurv-select' to denote the machine learning-based selection models. For the sake of convenience, we use only the machine learning models' names to r epr esent the machine learning-based variab le selection frame wor k throughout this paper. We implement the feature selection procedure in 'Xsurv' with function 'x select'. Different machine learning methods can be used in 'x select' by choosing the corresponding option in 'method' argument. In this study, we focus on fiv e different machine learning methods, which can be mainly divided into two categories: linear model and nonlinear model. A summarization for the different machine learning methods is presented in Figure 1 based on simulations and experience in a previous work ( 7 ). We evalua te dif ferent machine learning approaches in four simulation scenarios: (i) linear model; (ii) quadratic model; (iii) nonlinear model without interactions and (iv) nonlinear model with interactions. Finally, we demonstrate the utility of these methods by applying them to different modalities of head and neck cancer data, followed by discussion.

Cox model
In this paper, we focus on the Cox proportional hazards ( 1 ) based model. In survival analysis, we consider the time-toe v ent (e.g. time to death) data ( t i , x i , δ i ) , where t i is the ob- where λ 0 ( t) is the baseline hazard and H ( ·) is a risk score function determined by the covariates. In the standard Cox model, the risk score is expressed as a linear function, i.e.
To estimate the r egr ession coefficient β, the loss function is defined by the negati v e log partial likelihood (PL) plus a penalty: is the penalty function, such as the L1 penalty or an elastic net penalty. Here, R( t i ) is the set of the observations at risk at time t i .

Prediction-oriented selection method
In general, the primary goal of most machine learning methods is to provide high prediction accuracy. Howe v er, the feature selection results may not be clearly clarified, especially in nonlinear machine learning models. Many machine learning software packages provide feature importance as an option for feature selection, but they do not gi v e a threshold to determine the final decision. Although the feature selection results can be directly determined by the coefficients in linear models, the results can still become unreliable due to excessive false positives. To reduce the number of false positi v es and keep the high prediction accuracy, the study by Kim et al. ( 4 ) proposed PROMISE that combines CV and SS together in generalized linear models. To deal with the survival data, we extend PROMISE from the generalized linear model to the Cox model (PROMISE- Cox). Furthermore, in order to identify the significant features in nonlinear machine learning models, we propose a pr ediction-oriented featur e selection algorithm that combines CV and topk selection. The description of the proposed algorithm with pseudo-code can be found in Supplementary Table S1. We implement the feature selection procedure in 'Xsurv' ( 7 ) with function 'x select'. Different machine learning methods can be used in 'x select' by choosing the corresponding option in 'method' argument. In this study, we focus on fiv e different machine learning methods, which can be mainly divided into two categories: linear model and nonlinear model. A summarization for the different machine learning methods is presented in Figure 1 . To convey the main idea of different machine learning methods, we gi v e a brief description of them in the following.
The model for PROMISE-Cox is trained by a linear objecti v e function with a lasso (L1) ( 2 ) or elastic net penalty ( 3 ). The lasso penalty can be written as and the elastic net penalty is a linear combination of L1 and L2 (ridge penalty) ( 9 ) penalties: The significant features or markers are identified with nonzero r egr ession coefficients. The pseudo-code for PROMISE-Cox can be found in Supplementary Table S1. We implement the PROMISE-Cox model in the R package 'Xsurv' in function 'x proms'.
Decision tree is a widely used machine learning technique for nonlinear models. In this paper, we introduce two representati v e decision tree models: gradient boosting decision tree and random forests. XGB and LGB are two modern gradient boosting models and have been implemented for survival data recently ( 7 ). Both XGB-Cox and LGB-Cox use the negati v e partial likelihood as the loss function. In gradient boosting, the loss function is iterati v ely optimized by finding a weak learner that is nearest to the negati v e gradients. We denote the gradient and second deri vati v e of the Cox PL loss by g i and s i , respecti v ely. The optimal basis function at the m th step η ( m ) can be written as Hence, the risk score function can be updated by where is the learning rate. In practice, is often set around 0.001 and we provide the frequently used searching values for in Table 3 . In particular, high computa tional ef ficiency is achie v ed with boosting tree frame-works ( 7 ). Furthermore, to avoid overfitting, XGB / LGB incorporate the regularization terms into the loss function. The main difference between XGB and LGB is that they use differ ent tr ee construction strategies. XGB uses the le v elwise strategy, while LGB employs a leaf-wise growth strategy for tree construction and the gradient-based one-side sampling to force a split. Although XGB with the Cox loss is implemented in the 'xgboost' package with the objecti v e option 'survival:cox', it is limited by the evaluation metric that only contains the PL loss function. In addition, there is no particular function for 'LightGBM' package on survival outcomes. We use the functions from 'Xsurv' package, in which both XGB and LGB are implemented on survival outcomes with different evaluation metrics [the PL loss and concor dance inde x ( C -inde x)]. Random forest ( 10 ) is another type of decision tree that predicts the results with the entir e for est. Random survival for ests (RSF) ( 11 ) is an extension of random forests in time-to-e v ent survi val data. In an RSF, each tree is constructed independently with a randomly drawn bootstrap sample. Different from PROMISE, XGB and LGB, RSF modeling does not depend on the Cox loss function. We use the function 'rfsrc' in the R package 'randomForestSRC' in our study. The splitting rule used by the package is the log-rank test statistic (12)(13)(14). The maximization of the log-rank split-statistic value ensures the largest survival difference between left and right daughter nodes.
Over the last decade, deep learning methods that are based on artificial neural networks draw a growing attention in machine learning field for the abilities in analyzing the structures of high-dimensional data in various areas such as image recognition (15)(16)(17)(18). To apply the deep learning methods in survival data, a Cox model-based neural network called DeepSurv was proposed by Katzman et al. ( 8 ). DeepSurv is a feedforward neural network that is trained by the Cox PL with regularization terms as the objecti v e function. The risk score function H ( ·) is then the function of network weights. The hidden layers of DeepSurv are constructed with a fully connected layer of nodes. To pre v ent the overfitting issue, each layer is followed by a dropout layer ( 19 ). The final output of DeepSurv is a single node that estimates the risk score function H ( ·) . In addition, Deep-Surv proposed a tr eatment r ecommender system that can be used to predict H ( ·) by gi v en treatment groups in a clinical study. To improve the network performance, several deep learning techniques have been employed in DeepSurv, such as scaled exponential linear units ( 20 ), adaptive moment estimation ( 21 ) and learning rate scheduling ( 22 ). We use the function deepsurv from the R package 'survivalmodels' to fit DeepSurv networks.
In general, different machine learning methods have their own strengths and limitations. The knowledge of the algorithms is important in order to decide the best suited method based on the application. We summarize some main advantages and disadvantages of the machine learning methods we introduced in this section in Table 1 .

Hyperparameter tuning
One of the most important things for machine learning methods is how to deal with the hyperparameter tuning. The optimal parameters are determined by the CV procedure included in feature selection algorithms for PROMISE-Cox and nonlinear machine learning models. Two parameter searching schemes are applied: grid search and random search. In particular, some machine learning models are not very sensitive to some specific hyperparameters, such as L1 / L2 regularization terms in XGB-Cox or LGB-Cox. Ther efor e, we can improve the computational efficiency by reducing the size of searching space. We offer the recommended setting of parameter values for different hyperparameters of the machine learning methods in Table 2 .

Evaluation metrics
Because of the censoring of survival data, it is difficult to e valuate survi val models ( 23 ). To assess the survival model performance, ther e ar e some evaluation metrics proposed for survival analysis. The C -index ( 24 ) is the most widely used evaluation metric for survival anal ysis. Specificall y, the C -index measures the prediction accuracy from concordant pairs. The definition of C -index is given by where P is the set of or derab le pairs and t i < t j . C -index takes the value from 0 to 1, and a higher C -index indicates a better prediction performance.
Another commonly used evaluation metric is integrated Brier score (IBS) ( 25 ). Brier score (BS) can be calculated as where ˆ S ( ·) is the survival function predicted by the model and ˆ G ( ·) is the survival function corresponding to censoring, i.e. ˆ G ( t) = P ( Cen > t) , where Cen is the censoring time. IBS is then defined as the integral form of BS: In this study, to demonstrate the idea of the feature selection frame wor k, we pr esent the r esults using C -index as the evaluation metric.

Simulations
We generated four simulation scenarios with varied model complexity. We begin with the Cox model with a simple linear link function in scenario 1 (linear model). More specifically, let covariates X = X 1 , . . . , X p be i.i.d. standard normal distributed random variables, then the failure time T follows an exponential distribution with mean at exp ( X 1 + X 2 + · · · + X q ) , where p is the total number of features and q is the number of true signals. The censoring time C follows an exponential distribution with a mean of q. Scenario 2 (quadratic model) is based on a quadratic function; i.e. the failure time T follows an exponential distribution with mean exp (1 / 2)( X 2 1 + X 2 2 + · · · + X 2 q ) . In scenario 3 (nonlinear model without interactions), the failure time T follows an exponential distribution with a mean of cos X 9 + X 2 10 − 1 ] + X 11 + · · · + X q } , where is the standard normal cumulati v e distribution function. Here, 10 ≤ q p. The censoring time has a 1 / 3 chance to be 0.02 and a 2 / 3 chance to be uniform (0, 0.02), and the censoring rate in this case is ∼30%. In scenario 4 (nonlinear model with interactions), the failure time T follows a Weibull distribution with a mean of sin X 9 · X 2 10 − 1 ] + X 11 + · · · + X q } . Similar to the simulations described above, the failure time T for risk group classification simulation follows a Weib ull distrib ution, with a shape parameter of 2 and the scale parameter The risk le v el is determined by the value of υ with a larger value of υ corresponding to a worse survival (Supplementary Figure S2). Based on the values of υ, patients can be classified as high-risk group or low-risk group. The training data consisted of 1000 random samples, whereas the test data consisted of 200 samples.

RESULTS
In this section, we assess the performances of discussed machine learning approaches for survival analysis using a suite of simulation settings and demonstrate the applications of these methods using a head and neck cancer dataset. In addition to prognostic predica tion, evalua ting the ef fecti v eness of biomar ker identification in either targeted or genome-wide screening is an important component of our attention.

Prognostic feature selection
The results of feature selection when we set N = 1000, p = 100 and q = 10 are shown in Figure 2 . We use the true positi v e rate (TPR) and the false positi v e rate (FPR) to characterize the performance of feature selection: TPR = # true signals selected # total true signals , FPR = # false signals selected # total false signals .
Overall, all approaches have yielded high TPR and low FPR in a low-dimensional setting. The r esults r e v ealed a larger discrepancy as the model complexity increased from scenario 1 to scenario 4. As expected, PROMISE, which only allows linear effects, exhibited the best performance in terms of TPR and FPR in the linear setting (scenario 1) but achie v ed the worst performance in nonlinear scenarios when compared to other methods. The four machine learning approaches (XGB, LGB, RSF and DeepSurv) achie v ed comparable outcomes in scenarios 2-4, with L GB slightl y outperforming the others, followed by XGB and DeepSurv. Howe v er, when the sample size is between 500 and 1000, it is difficult to tell the method with the optimal TRP and FPR. The fitted line in Figure 3 illustrates the pattern of the different machine learning methods in TPR and FPR as the featur e size incr eases. As expected, the feature selection performances of all methods decrease as the feature dimension increases, but L GB a ppears to be the most robust method. Similar to the low-dimensional scenario, PROMISE-Cox outperforms other methods in scenario 1, whereas the other methods exhibit a similar pattern as shown in Figure 3 . Collecti v el y, our sim ulations show that LGB and XGB tend to achie v e superior performances when the training sample size is limited.

Pr ediction perf ormance
We evaluate the prediction performances of various models under four different scenarios: linear model (scenario 1), quadratic model (scenario 2), nonlinear model without interactions (scenario 3) and nonlinear model with interactions (scenario 4). In each simulation, 80% of samples were used as training data and 20% as test data. For each scenario, 100 replicates were generated. Figure 4 summarizes the results with a sample size N = 500, feature number p = 1000 and number of true signals q = 20. The benchmark comparison for C -index in the test data is reported in Figure  4 A. As expected, PROMISE-Cox was consistently the topranked method in scenario 1 (Figure 4 B and C). In all other sim ulation scenarios, L GB outperformed the other methods on survival pr ediction (Figur e 4 A and B) and feature selection (Figure 4 C). As expected, PROMISE yielded the lowest C -index and the worst selection results after including more nonlinear components in scenarios 3 and 4. We also conducted simulation studies with N = 1000, p = 2000  Figure S2). With the addition of more interaction terms in scenario 4, DeepSurv emerged as the overall best performing method (Supplementary Figure S1).

Risk group classification
In this section, we further study the performances of patient risk group classification based on the following machine learning methods: XGB, LGB, RSF and DeepSurv. In simula ted da ta, 2000 fea tur es wer e generated from a normal distribution with three groups of means μ i : μ 1 = 5 , μ 2 = 10 and μ 3 = 15 , with each group containing 400 samples.
As shown in Figure 5 A, among all machine learning methods, LGB yields the best misclassification rate. As expected, the predicted survival probability in the 'low-risk' group is significantly better than the 'high-risk' group in all machine learning methods (Figure 5 B), with LGB demonstrating a clearer distinction between two groups.

Identifying prognostic biomarkers in HNSCC with different modalities
Here, using The Cancer Genome Atlas head and neck squamous cell carcinoma (HNSCC) dataset, we demonstrate the use of survival machine learning methods for biomarker prioritiza tion. Pa tients with three clinical covariates (age, sex and stage) and 15 878 long noncoding RNAs (lncR-NAs), 1406 microbiome covariates, 20 518 mRNAs and 25 101 Copy Number alterations (CNAs) are included in the discovery datasets. The sample size for each modality is 499, 514, 515 and 516, respecti v ely. The survi val outcome had a censoring rate of around 55% for different modalities. For each different category, we randomly select 80% of samples of the original data as training data and the remaining 20% samples as the test da ta. W hen evalua ting the performance of various machine learning methods, we focused on C -index and IBS of the test dataset. As shown in Figure 6 D, LGB yielded the best C -index in the extremely high-dimensional cases for lncRNA, mRNA and CNA. Similar results regarding the performances of different methods on IBS are shown in Supplementary Figur e S3, wher e LGB had the smallest values for extr emely high-dimensional cases and XGB performed best on microbiome modality for a moderate dimensional case. We highlighted the overlapped top 3 markers that were selected by multiple methods in Table 3 . LncRNAs have emerged as potential prognostic biomarkers for predicting therapeutic outcomes in cancer ( 26 , 27 ). Notably, the results from LGB-based models contain more overlapped lncRNAs than any other methods, demonstrating their high reliability on biomarker selection, while the XGB-based model outperforms the other methods with moderate feature dimension ( p = 1406). As shown in Figure 6 A, higher gene expression values of RP11.147L13.8 and LINC00482 are associated with better survival in HNSCC patients, while a higher value of LINC01338 predicts worse survival. As illustrated in Figure 6 B, survival curves for patient subgroups stratified according to the LGB model's predicted risk groups demonstrate a significant disparity in surviv al with P -v alue < 0.0001. The SHAP v alue plot ( Figure  6 C) ( 28 ) re v eals that RP11.147L13.8 and LINC01338 are also listed among top 3 prognostic biomarkers. Their prognostic values have also been reported in other cancer types ( 29 , 30 ). For other modalities, some significant biomarkers recognized for HNSCC in recent studies, such as EGFR ( 31 ) and USP14 ( 32 ), were identified by the LGB-based models, and both EGFR and USP14 show negati v e influence on patients' survival with larger values (Figure 6 A).

DISCUSSION
In this study, we compared the performance of the prediction-oriented prognostic biomarker selection framework based on five machine learning methods: PROMISE-Co x, XGB-Co x, LGB-Co x, RSF and DeepSurv. Our find-   ings suggest that the prediction-based biomarker selection strategy is a viable option for biomark er disco very. Consistent with previous findings ( 33 , 34 ), PROMISE-Cox outperforms other approaches in linear models, while XGB-Cox and LGB-Cox perform better in the nonlinear highdimensional scenarios. Additionally, our simulation results indica te tha t L GB-Cox is recommended w hen the sample size is < 500 and the feature number is > 1000, whereas DeepSurv tends to deli v er better results when the sample size is further increased (e.g. when n > 2000). Gi v en the fact that most cancer genomic studies have sample size < 1000, we recommend use of boosting-based methods as the main frame wor k for prognostic biomarker discovery in oncology.
It is important to acknowledge the fact that hyperparameter tuning during the model deployment can have a significant impact on the final performance. In this aspect, RSF and PROMISE-Cox methods do provide advantages in terms of the number of hyperparameters, although the computational burden in PROMISE-Cox will increase con-siderabl y w hen mor e r esampling steps ar e incorporated into the nested CV and SS steps. While XGB-Cox and LGB-Cox have more hyperparameters than RSF and PROMISE, they ar e mor e manageable than deep learning methods in the parameter tuning burden. Based on both our experience and results from this study, we recommend no more than eight layers in DeepSurv (including four dense layers and four dropout layers) and consider less than four layers in the initial run when the sample size is between 250 and 1000. To ensur e r epr oducibility, we have pr ovided functions in the package Xsurv ( 7 ) that enables the automatic parameter tuning process.
Modern supervised machine learning methods are designed to deli v er the predicti v e models efficiently by incorporating the training data as a whole. Due to their blackbox nature, the performances of these new methods in reliably selecting true signal features or prioritizing candidate biomarkers, especially in the context of survival models, have been largely understudied. Although some soft-ware packages can provide metrics on feature importance (such as information gain or SHAP values in XGB / LGB, feature importance in RSF and importance scores in deep learning models), the information on the number of selected features in the final model is often not readily accessible. The proposed pr ediction-oriented featur e selection framework combines the machine learning methods with a robust feature selection procedure based on CV and the topk selection procedure. We demonstra te tha t this frame wor k outperfor ms PROMISE in ter ms of feature selection and pr ediction r esults in nonlinear model settings. In comparison to conventional machine learning methods for survival data, the proposed frame wor k provides more meaningful insight into the importance of the candidate biomarkers.
Although we focused on the biomarker identification based on the Cox models, the prediction-oriented framework combined with machine learning methods can be extended to a variety of other survival models, including the accelerated failure time (AFT) model ( 35 ), censoring unbiased deep learning ( 36 ) and alternati v e deep learning methods such as deep survival machines ( 37 ). The AFT model, in particular, has been implemented in the Xsurv package for implementing XGB-Cox and LGB-Cox, and is thus readily available and utilized to fulfill the proposed framework.

DA T A A V AILABILITY
HNSCC data underlying this article are available in the Cancer Genome Atlas Program(TCGA) database. Data can be found in PanCancer Atlas publication pages ( https://gdc.cancer.gov/about-da ta/publica tions/ pancanatlas ) and can be downloaded using cBioPortal a t ( https://www.cbioportal.org/stud y/summary?id= hnsc tcga pan can atlas 2018 ). Code and scripts used to generate the results are available in the Zenodo doi: ( https://doi.org/10.5281/zenodo.7991272 ).